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Introduction: 

This  report  describes  work  done  by  Alexandra  Zoltan-Jones  for  the  period  July  1, 2003-  June 
30,  2004.  The  grant  was  recently  transferred  to  Silva  Krause. 

Task  1:  To  examine  the  effects  of  perturbing  hyaluronan  levels  on  mammary  tissue  morphology 
and  polarization  in  a  three-dimensional  system. 

Unfortunately,  1  was  unable  to  set  up  the  three-dimensional  culture  system  satisfactorily  in 
our  laboratory,  despite  assistance  from  the  laboratory  of  Dr.  Mina  Bissell.  After  much  discussion 
with  Dr.  Bissell  and  other  members  of  my  advisory  committee,  it  was  decided  that  this  culture 
system  was  not  a  suitable  undertaking  for  a  graduate  student,  due  to  intense  technical  and  time 
requirements.  In  Dr.  BisselPs  lab,  a  specialized  technical  investigator  carries  out  this  technique. 

Task  2:  To  examine  the  effects  of  increased  hyaluronan  on  regulation  of  cell  proliferation  via 
phosphoinositide  3-kinase/PTEN,  ILK  and  p-catenin. 

Originally  the  system  described  under  Task  1  was  to  be  used  for  this  purpose.  Instead  I  used 
standard  cultures  of  MCF-10A  mammary  epithelial  cells  as  well  as  MDCK  kidney  epithelial 
cells,  which  are  the  ‘gold  standard’  system  for  molecular  and  cellular  studies  of  epithelial- 
mesenchymal  transformation.  Using  these  cells  we  studied  the  effects  of  up-regulating 
endogenous  hyaluronan  synthesis  on  changes  in  cell  characteristics  associated  with  epithelial- 
mesenchymal  transformation,  including  cell  proliferation  and  invasion. 

Using  recombinant  adenoviral  expression  of  hyaluronan  synthase  2, 1  showed  that  increased 
hyaluronan  production  promotes  anchorage-independent  growth  and  invasiveness,  induces 
gelatinase  production,  and  stimulates  phosphoinositide-3-kinase/Akt  pathway  activity  in 
phenotypically  normal  MDCK  canine  kidney  and  MCF-10A  human  mammary  epithelial  cells. 
Cells  infected  with  hyaluronan  synthase  2-adenovirus  also  acquire  mesenchymal  characteristics, 
including  up-regulation  of  vimentin,  dispersion  of  cytokeratin,  and  loss  of  organized  adhesion 
proteins  at  intercellular  boundaries.  Furthermore,  I  showed  that  transforming  effects  of  two  well- 
described  agents,  hepatocyte  growth  factor  and  p-catenin,  are  dependent  on  hyaluronan-cell 
interactions.  Perturbation  of  endogenous  hyaluronan  polymer  interactions  by  treatment  with 
hyaluronan  oligomers,  which  antagonize  constitutive  hyaluronan-CD44  interactions,  was  shown 
to  reverse  the  transforming  effects  of  hepatocyte  growth  factor  and  P-catenin  in  MDCK  canine 
kidney  and  MCF-10A  human  mammary  epithelial  cells.  Also,  hepatocyte  growth  factor  and  p- 
catenin  induce  assembly  of  hyaluronan-dependent  pericellular  matrices  similar  to  those 
surrounding  mesenchymal  cells.  Thus  increased  expression  of  hyaluronan  is  sufficient  to  induce 
epithelial  to  mesenchymal  transition  and  acquisition  of  transformed  properties  in  phenotypically 
normal  epithelial  cells.  This  study  is  published  (Zoltan-Jones  et  al.,  2003). 

In  collaborative  studies  with  other  members  of  the  lab,  I  also  provided  evidence  that  the 
extracellular  matrix  metalloproteinase  inducer,  emmprin,  stimulates  hyaluronan  production  in 
human  mammary  tumor  cells,  and  promotes  anchorage-independent  growth  and  drug  resistance 
in  a  hyaluronan  dependent  manner  (Misra  et  al.,  2003;  Marieb  et  al.,  2004).  Using  this 
information,  I  went  on  to  demonstrate  that,  in  weakly  malignant  MDA-MB-436  human 
mammary  carcinoma  cells  and  in  phenotypically  normal  MCF-10A  human  mammary  epithelial 
cells,  over-expression  of  emmprin  stimulates  both  hyaluronan  production  and  expression  of 
CD44,  a  major  cell-surface  hyaluronan  receptor.  I  also  showed  that  elevated  emmprin  in  MCF- 
10A  cells  promotes  their  anchorage-independent  growth  in  soft  agar  and  increases  activity  of  the 
phosphoinositide  3-kinase  /Akt,  MAP  kinase  and  focal  adhesion  kinase  cell  survival  signaling 
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pathways.  These  effects  are  reversed  by  perturbation  of  endogenous  hyaluronan-cell  interactions 
by  treatment  with  HA  oligomers.  Furthermore,  I  showed  that  emmprin-stimulated,  hyaluronan- 
CD44  interactions  regulate  upstream  events  that  are  known  to  control  these  cell  survival 
pathways,  specifically  CD44-ezrin  and  CD44-ErbB2  interactions,  as  well  as  activation  of  ErbB2. 
Treatment  with  hyaluronan  oligomers  causes  dissociation  of  these  complexes,  thus  indicating 
that  endogenous  hyaluronan  polymer  is  necessary  for  the  CD44-ezrin  and  CD44-ErbB2 
interactions.  This  work  is  being  prepared  for  publication. 

Task  3:  To  examine  the  effects  of  hyaluronan  on  focal  adhesion  kinase  activity. 

As  part  of  the  work  described  under  task  2, 1  also  showed  that  increased  hyaluronan 
production  stimulates  focal  adhesion  kinase  activity  in  weakly  malignant  MDA-MB-436  human 
mammary  carcinoma  cells  and  in  phenotypically  normal  MCF-10A  human  mammary  epithelial 
cells,  and  that  treatment  with  hyaluronan  oligomers,  which  antagonize  constitutive  hyaluronan- 
CD44  interactions,  reverse  this  effect. 

Key  accomplishments 

•  Showed  that  increased  hyaluronan  is  sufficient  for  epithelial-mesenchymal 
transformation  in  phenotypically  normal  epithelial  cells 

•  Showed  that  induction  of  epithelial-mesenchymal  transformation  by  hepatocyte  growth 
factor  and  p-catenin  is  dependent  on  hyaluronan 

•  Showed  that  increased  hyaluronan  production  causes  increased  phosphoinositide  3-kinase 
/Akt  pathway  activity 

•  Assisted  in  studies  showing  that  emmprin  stimulates  hyaluronan  production,  and 
promotes  anchorage-independent  growth  and  drug  resistance  in  a  hyaluronan  dependent 
manner 

•  Showed  that  emmprin  stimulates  ErbB2  signaling  and  phosphoinositide  3-kinase  /Akt, 
MAP  kinase  and  focal  adhesion  kinase  cell  survival  signaling  pathways  in  a  hyaluronan 
dependent  manner 

Outcomes 

•  Completed  Ph.D.  in  Cell,  Molecular  and  Developmental  Biology  at  Tufts  University 

•  Published  three  manuscripts  -  see  references 

•  Obtained  Postdoctoral  position  studying  breast  cancer  at  National  Cancer  Institute 
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Sequences  that  are  present  in  a  given  species  or  strain  while  absent  from  or  different 
in  any  other  organisms  can  be  used  to  distinguish  the  target  organism  from  other 
related  or  un-related  species.  Such  DNA  signatures  are  particularly  important  for 
the  identification  of  genetic  source  of  drug  resistance  of  a  strain  or  for  the  detection 
of  organisms  that  can  be  used  as  biological  agents  in  warfare  or  terrorism.  Most 
approaches  used  to  find  DNA  signatures  are  laboratory  based,  require  a  great 
deal  of  effort  and  can  only  distinguish  between  two  organisms  at  a  time.  We 
propose  a  more  efficient  and  cost-effective  bioinformatics  approach  that  allows 
identification  of  genomic  fingerprints  for  a  target  organism.  We  validated  our 
approach  using  a  custom  microarray,  using  sequences  identified  as  DNA  fingerprints 
of  Bacillus  anthracis.  Hybridization  results  showed  that  the  sequences  found  using 
our  algorithm  were  truly  unique  to  B.  anthracis  and  were  able  to  distinguish  B. 
anthracis  from  its  close  relatives  £.  cereus  and  B.  thuringiensis. 


1.  Introduction 

The  area  of  organism  identification  using  DNA  sequences  has  many  appli¬ 
cations  in  various  life  science  areas.  However,  there  are  also  many  chal¬ 
lenges.  For  instance,  sheep  pox  and  goat  pox  viruses  are  so  closely  related 
that  they  cannot  be  distinguished  using  clinical  signs,  pathogenesis  or  sero- 
reactivity.28  Furthermore,  both  cross-infectivity  and  cross-resistance  have 


f  These  authors  should  be  considered  joint  first  authors. 
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been  reported36  to  the  point  that  the  two  were  thought  to  be  caused  by 
a  single  viral  species.  However,  genetic  analysis  demonstrated  that  sheep 
pox  and  goat  pox  are  actually  caused  by  two  related,  but  genetically  dis¬ 
tinct  viruses.  Furthermore,  the  identification  of  a  few  base  pair  differences 
in  the  sequence  coding  for  the  P32  protein  allowed  the  design  of  a  poly¬ 
merase  chain  reaction  (PCR)  restriction  fragment  length  polymorphism 
(PCR  RFLP)  assay  able  to  distinguish  between  the  two  species.  This  assay 
involves  a  PCR  amplification  with  a  common  primer,  followed  by  a  diges¬ 
tion  with  a  Hinf  I  restriction  enzyme  that  produces  fragments  of  different 
sizes  allowing  the  identification  of  the  two  species. 

The  issue  of  distinguishing  between  different  species  is  somewhat  aca¬ 
demic  if  the  two  species  exhibit  both  cross-infectivity  and,  most  impor¬ 
tantly,  allow  passive  cross-protection  as  the  sheep  pox  and  goat  pox  do.35 
However,  this  is  not  always  the  case.  Genes  that  are  present  in  certain 
isolates  of  a  given  bacterial  species  and  are  substantially  different  or  absent 
from  others  can  determine  important  strain-specific,  traits  such  as  drug  re¬ 
sistance13  and  virulence.61  As  an  example,  B.  anthracis ,  B.  cerews,  and  B. 
thuringiensis  are  genetically  so  close  that  it  has  been  proposed  to  consider 
them  a  single  species.25  At  the  same  time,  these  bacteria  are  very  different 
on  a  phenotypic  level.  B.  cereus  is  a  frequent  food  contaminant  but  only  a 
mild  opportunistic  human  pathogen; 10,26  B.  thuringiensis  is  actually  a  use¬ 
ful  bacterium  being  used  as  a  pesticide46  while  B.  anthracis  is  a  virulent 
pathogen  for  mammals  that  has  been  used  as  a  bio-terror  and  biological 
warfare  agent.12,53 

In  such  cases,  the  identification  of  an  organism-specific  DNA  sequence 
gains  an  increased  importance.  Even  if  such  sequences  are  not  functionally 
active,  they  can  still  be  extremely  useful  if  used  as  genetic  fingerprints. 
DNA  sequences  that  are  present  in  a  given  species  while  absent  from  any 
other  organisms  can  be  used  to  distinguish  the  target  organism  from  other 
related  or  un-related  species.  If  such  genetic  fingerprints  were  available  for 
organisms  that  can  be  potentially  used  as  biological  or  terrorist  weapons, 
the  task  of  rapid  threat  identification,  characterization,  and  selection  of  ap¬ 
propriate  medical  countermeasures  could  be  immensely  facilitated.  Genetic 
fingerprints  can  also  aid  identification  of  genetic  source  of  drug  resistance 
of  a  strain,17  which  can  be  useful  to  drug  developers  in  pharmacogenomics. 
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2.  Existing  work 

The  existing  work  in  the  areas  of  organism  identification  using  DNA  sig¬ 
natures  can  be  divided  into  two  different  categories.  One  approach  uses  a 
laboratory  assay  to  identify  the  organism.  Techniques  used  include  ampli¬ 
fied  fragment  length  polymorphism  (AFLP), 44,45  suppression  subtractive 
hybridization  (SSH)3  and  custom  DNA  microarrays.34  A  second  approach 
uses  a  purely  bioinformatics  analysis  of  the  characteristics  of  the  genomes 
of  various  species  and  extracts  those  features  that  are  characteristic  to  in¬ 
dividual  species. 

The  laboratory  based  approach  does  not  necessarily  require  information 
about  the  entire  genomes  involved  and  is  better  suited  for  the  development 
of  assays  for  monitoring  and  identification  of  biological  threats.  For  in¬ 
stance,  SSH,  a  PCR-bascd  DNA  subtraction  method,  allows  identification 
of  genomic  sequence  differences  in  a  “tester”  DNA  relative  to  a  “driver” 
DNA.  AFLP  relics  on  the  analysis  of  a  fluorescence  based  signal  propor¬ 
tional  to  the  size  of  various  DNA  fragments.49  SSH  and  AFLP  have  been 
successfully  used  to  identify  genomic  sequence  differences  between  various 
strains  or  species  of  bacteria.4,5,10,29,44  The  major  drawback  of  this  ap¬ 
proach  is  that  it  permits  identification  of  genomic  differences  only  between 
two  organisms.  For  instance,  in  order  to  differentiate  two  species,  one  needs 
to  use  an  SSH  assay  to  compare  each  strain  of  one  species  with  each  strain 
of  the  other  species.44  Clearly,  this  approach  cannot  be  used  to  provide  a 
genomic  signature  that  would  differentiate  a  given  organism  from  all  others. 

The  in  silico  approach  to  identifying  genomic  signatures  is  usually  based 
on  an  analysis  of  the  entire  genomes  involved  and  aims  at  extracting  fea¬ 
tures  such  as  species-specific  codon  usage. 1,2,21,30-32,52  While  this  type  of 
genomic  signature  can  be  informative  about  the  given  organisms  and  the 
relationships  among  them,  it  may  not  be  directly  usable  for  detection  and 
monitoring  purposes. 

Comparative  sequence  analysis  has  also  been  useful  in  detecting  in- 
tronic  and  intergenic  regions23,38  as  well  as  uncovering  novel  repeated 
structures.18,24  Several  genome  scale  alignment  tools  are  available:  MUM- 
mer, 14,15,37  AVID,11  MGA,27  WABA,33  and  GLASS7  among  others.  Tax- 
Plot41  provides  visual  representation  of  protein  homologs  in  microbial  and 
eukaryotic  genomes.  Most  of  these  pair-wise8  alignment  tools  assume  that 
the  input  genomes  arc  closely  related.  Therefore,  there  will  be  a  mapping 


MGA  is  a  multiple  alignment  tool  but  the  alignment  is  still  computed  pair-wise. 
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of  large  subsequences  between  the  two  input  genomes.  In  turn,  they  assume 
that  these  large  subsequences,  appearing  in  the  same  order  in  the  closely 
related  genomes,  arc  very  likely  to  be  part  of  the  final  alignment.  These 
regions  arc  used  as  anchors  for  the  alignment  of  the  input  genomes. 

In  general,  anchor- based  genome  alignment  programs  first  create  a  suffix 
tree  from  the  two  input  genomes.  A  suffix  tree  is  a  compact  representation 
of  all  suffixes  in  the  input  string.39,54  A  suffix  of  a  string  is  a  substring 
starting  at  any  position  in  the  string  and  extending  up  to  the  end  of  the 
string.  Next,  the  suffix  tree  is  searched  for  sequences  that  appear  in  both 
input  genomes.  These  exact  matching  subsequences  are  known  as  maxi¬ 
mal  exact  matches  (MEMs).  The  anchors  are  chosen  from  these  MEMs. 
Different  programs  apply  different  criteria  for  the  selection  of  anchors.  For 
instance,  MUMmcr  uses  the  longest  increasing  subsequence  (LIS)22  for  the 
selection  of  anchors.14  MUMmer  allows  the  selection  of  overlapping  an¬ 
chors  whereas  AVID  and  MGA  only  select  non-overlapping  anchors.  Since 
MG  A  allows  alignment  of  more  than  two  genomes,  it  only  selects  MEMs 
that  are  present  in  all  of  the  input  genomes.  AVID  first  finds  the  length 
of  the  longest  MEM  and  discards  all  the  MEMs  that  are  less  than  half  the 
length  of  the  longest  MEM.  After  selecting  the  anchors,  MUMmer  employs 
a  variant  of  the  Smith- Waterman  algorithm47  to  close  the  gaps  between 
the  anchors.  MGA  and  AVID  close  the  gaps  by  recursively  creating  suffix 
trees  for  the  non-anchored  parts  of  the  input  genomes  and  hence,  gradu¬ 
ally  reducing  the  gap  sizes.  Once  the  gaps  are  smaller  than  a  threshold, 
MGA  and  AVID  close  them  using  the  ClustalW48  and  Needleman-Wunsch 
algorithms,42  respectively. 

These  large  number  of  tools  are  all  geared  towards  finding  large-scale 
similarities  between  two  or  more  genomes.  Our  focus  here  is  different. 
While  these  algorithms  were  developed  to  find  sequence  similarities,  our 
goal  is  to  find  sequence  dissimilarities.  These  two  problems  are  related  but 
not  reciprocal.  Simply  put,  one  cannot  just  take  the  complement  of  the 
sequences  found  in  a  similarity  search  and  use  them  as  genomic  signatures. 
The  main  reason  is  related  to  the  fact  that  a  search  aiming  to  find  similarity 
will  sometimes  discard  entire  blocks  after  only  a  summary  inspection  be¬ 
cause  they  are  not  sufficiently  similar  to  the  target  sequence.  On  the  other 
hand,  a  search  aiming  to  find  dissimilarities,  i.e.,  unique  signatures,  has  to 
actually  focus  on  exactly  those  areas  that  are  discarded  without  extensive 
analysis  during  the  similarity  search. 

Here,  we  propose  an  algorithm  for  finding  genomic  fingerprints  that 
distinguish  an  organism  from  all  other  organisms  with  known  genomes. 
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As  the  number  of  sequenced  organisms  increases,  this  approach  has  the 
potential  to  substitute  existing  laboratory  based  approaches  such  as  AFLP 
and  SSH. 

In  this  paper,  we  used  this  approach  to  find  a  genetic  signature  for 
B.  anthracis .  Identification  of  genomic  regions  unique  to  B.  anthracis  can 
provide  clues  to  its  genetic  relationship  to  other  highly  similar  organisms. 
Related  work  for  the  detection  of  B.  anthracis  used  plasmid-encoded  toxin 
genes  for  rapid  DNA-based  assays.8  However,  these  failed  to  detect  non¬ 
plasmid  containing  strains  of  B .  anthracis  isolated  from  the  environment.50 
Also,  there  have  been  efforts  to  design  real-time  PCR  assays.  However, 
these  assays  only  targeted  a  single  locus  and  they  yielded  false-positive 
results  with  some  strains  of  B.  cereus. 20,43 


3.  Analysis  methods 

Our  goal  is  to  find  unique  DNA  sub-sequences  for  a  given  target  genome 
across  all  available  known  genomes.  An  obvious  approach  is  to  compare 
(i.e. ,  align)  the  genome  of  our  interest  against  all  available  known  genomes. 
These  alignments  will  reveal  the  parts  of  the  target  genome  that  do  not  align 
with  any  other  genome  (i.e.,  arc  unique  to  the  target  genome).  However, 
this  seemingly  simple  approach  is  computationally  very  expensive.  The 
GenBank  database  at  NCBI  contains  nucleotide  sequences  from  more  than 
140,000  organisms.9  The  length  of  these  genomes  vary  from  a  few  thousand 
base  pairs  to  a  few  billion  base  pairs.  Aligning  the  input  genome  with  each 
of  these  genomes  is  computationally  unfeasible. 

The  amount  of  computation  can  be  considerably  reduced  by  using  the 
phylogenetic  background  of  the  target.  Today  biologists  agree  that  various 
organisms  have  evolved  from  common  ancestors.  During  evolution,  func¬ 
tional  genomic  elements  are  conserved.  Hence,  two  closely  related  genomes 
are  expected  to  have  many  matching  subsequences.  If  a  subsequence  that 
distinguishes  the  target  from  all  organisms  exists,  this  subsequence  will  also 
distinguish  the  target  from  its  closest  relative.  Hence,  a  good  initial  set  of 
potential  genomic  signatures  can  be  obtained  by  comparing  the  target  only 
with  its  closest  relative  and  by  retaining  only  those  sequences  that  are  dif¬ 
ferent.  Subsequently,  each  of  these  potential  signatures  is  compared  with  all 
other  known  genomes.  This  approach  drastically  reduces  both  the  number 
of  comparisons  required  as  well  as  the  length  of  sequences  to  be  compared 
(from  a  few  million  to  a  few  thousand  base  pairs,  at  most). 

In  order  to  find  the  exact  matching  sequences  between  the  target  and  its 
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closest  relative,  we  start  by  using  their  concatenated  sequences  to  create  a 
suffix  tree.  We  then  use  a  suffix  tree  search  algorithm  as  the  one  employed 
in  MUMmer  to  find  the  exact  matching  sequences  in  both  genomes.  Since 
our  goal  is  to  determine  a  set  of  relatively  short  sequences  to  be  used  on  a 
microarray  type  assay,  we  have  to  search  both  the  forward  and  the  reverse 
strands.  Any  sequences  that  match  between  the  two  organisms  are  removed 
from  further  consideration.  The  result  is  a  set  of  short  segments  of  the 
target  genome  that  can  be  considered  potential  signatures.  These  are  then 
compared  with  all  sequences  in  the  blast- nt 40  database  from  NCBI.6  We 
consider  a  sequence  is  unique  for  the  target  genome  if  it  does  not  align  to 
any  sequence  from  any  other  organism  with  an  expected  value  (£-valuc) 
less  than  a  threshold  of  0.01.  Fig.  1  provides  an  overview  of  this  approach. 


—  Closely  related  genome 
Target  genome 


A2 


a3 


A . 

y. _ 

a2 

a3 

Sequence 

Database 

jjdidate  unique  sequence 


Sequences  producing  significant  alignments : 

emb| AJ414144.1)  Yersinia  pestis  strain  C092  complete  genome;  seg 
gb I iEtJ  13947.1 1  Yersinia  pestis  KIM  section  347  of  415  of  the  com 
gb|AE013630 . 1 j  Yersinia  pectin  KIM  section  30  of  415  of  the  comp 
embl AJ4141S9.1 |  Yersinia  pestis  strain  C092  complete  genome;  seg 
emb| AL59136G . 21 |  Human  DMA  sequence  from  clone  RP11-223J24  on  ch 


Score  E 
(bits)  Value 


572 

44 

44 


0.0 

0.17 

0.17 

0.17 
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Sequences  producing  significant  alignments: 


Score  E 
(bits)  Value 


gb| AE013601 . 1 |  Yersinia  pestis  KIK  section  1  of  41S  of  the  eompl . . . 
emb [AJ414141.il  Yersinia  pestis  strain  C092  complete  genome;  seg... 
gbl AC067903 . 13 [  Homo  sapiens,  clone  RP11-740O11.  complete  sequence 
gb|AC093825  2|  Homo  sapiens  BAC  clone  RP11-38904  from  A.  complet . . . 
dbj | AP005015 . 1 |  Homo  sapiens  genomic  DMA,  chromosome  8q23,  clone... 
gbl AY247742 . 1 1  Gall  us  gallus  TRB2  protein  mRNA,  complete  cds 
ah  i  AF1  11  Hnmn  sarvien*  CBTM1  nrnt.ain  nartiml  rrt*  ■  *r>rt 


137 

137 

42 

42 

42 

40 


2e-30 

2e-30 

0.10 

0.10 

0.10 

0-41 

0  41 


Figure  1.  The  genomic  fingerprinting  approach.  Two  genomes  are  searched  for  exact 
matching  subsequences  (MEMs).  The  MEMs  are  removed  from  the  target  genome  and 
the  remaining  segments  of  the  target  genome  (Aj,  A2,  . ..,  An)  are  searched  against 
the  nt  database.  If  the  length  of  a  segment  is  less  than  the  user  specified  length,  it  is 
discarded  and  not  searched  in  the  nt  database.  As  shown,  if  a  sequence  does  not  align 
with  any  sequence  from  another  organism  with  E  value  less  than  the  specified  threshold 
it  is  considered  as  a  sequence  unique  to  the  target  genome. 


4.  Results  and  discussion 

In  order  to  validate  our  approach,  we  designed  a  custom  microarray  using 
sequences  identified  as  genomic  fingerprints  for  B .  anthracis.  This  array 
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was  then  hybridized  with  B.  anthracis  and  B,  cereus. 

In  order  to  find  a  genomic  signature  for  B.  anthracis  we  proceeded 
as  follows.  We  searched  the  B.  anthracis  st-r.  Ames  genome  (GenBank 
contig  accession  number  NC  .03997)  for  subsequences  of  30  base  pairs  or 
more  matching  anywhere  (direct  and  reverse  strand)  with  sequences  from 
the  genome  of  B.  cereus  ATCC  14579  (GenBank  contig  accession  number 
NC.004722).  We  chose  B.  cereus  ATCC  14579  genome  as  a  closely  re¬ 
lated  genome  because  it  is  considered  to  be  a  good  representative  of  the 
B.  cereus  family.19  Then,  we  removed  all  of  matching  sequences  from  the 
B.  anthracis  genome.  This  step  produced  over  6,000  sequences  of  length 
50  or  more.  These  sequences  were  then  searched  against  the  nt  database 
using  blastn.  The  sequences  in  the  BLAST  output  that  were  not  found  in 
any  other  organism  with  E  value  less  than  0.01  were  retrieved  and  con¬ 
sidered  part  of  the  genomic  fingerprints  of  B.  anthracis .  There  were  140 
such  sequences.  Note  that  this  analysis  stage  also  removed  sequences  that 
matched  the  genomes  of  other  close  relatives  of  B.  anthracis ,  such  as  B. 
thuringiensis,  without  ever  directly  comparing  them.  These  140  target  se¬ 
quences  were  provided  to  CombiMatrix  (Mukilteo,  WA)  for  the  design  of  a 
custom  microarray.  CombiMatrix  designed  2  probes  for  80  target  sequences 
and  1  probe  for  22  target  sequences  (for  a  total  of  182  probes  for  102  target 
sequences)  with  melting  temperature  in  the  range  of  70° C  to  75°C  and  a 
length  of  35  base  pairs  or  more.  Probes  of  the  required  length  and  melting 
temperatures  could  not  be  identified  for  the  remaining  38  target  sequences. 
The  microarray  was  designed  with  three  replicates  of  each  of  the  182  probes. 

The  custom  microarray  was  then  hybridized  with  samples  of  B.  an¬ 
thracis  and  B.  cereus.  The  hybridization  results  showed  that  18  probes 
only  hybridized  to  the  B.  anthracis  sequences  indicating  that  they  were 
true  genomic  fingerprints  of  B.  anthracis .  Table  1  provides  the  positions  of 
the  sequences  on  B.  anthracis  genome  that  were  found  to  be  unique  in  the 
microarray  experiment. 

Surprisingly,  many  of  the  initial  182  probes  also  hybridized  with  B. 
cereus.  We  further  searched  these  cross-hybridizing  probes  against  the 
blast-nf  database.  For  the  probes  that  hybridized  to  B.  cereus  the  re¬ 
sults  of  this  comparison  showed  that  although  the  target  sequences  of  those 
probes  are  only  present  in  B.  anthracis ,  the  part  of  the  target  sequence 
on  which  the  probes  were  designed  was  not  unique  to  B.  anthracis  and  is 
present  in  other  genomes.  This  shows  that  the  probe  design  stage  lost  some 
specificity  due  to  its  unique  added  requirements:  melting  temperatures  in 
a  very  narrow  range,  limited  lengths,  etc.  In  all  cases,  although  the  initial, 
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longer  sequence  was  unique  across  the  blast-  nt  database,  by  selecting  a 
shorter  subsequence,  the  probe  became  unspecific.  Hence,  another  BLAST 
search  is  recommended  before  printing  the  assay,  to  check  whether  the  sub¬ 
sequences  selected  as  probes  continue  to  be  good  signatures  for  the  target 
organism. 


Table  1.  The  following  18  probes  identify  17  unique  se¬ 
quences  of  B.  anthracis  (Ames).  The  first  and  second 
columns  indicate  the  start  and  end,  respectively,  of  the  tar¬ 
get  sequences  from  B.  anthracis.  The  third  and  the  fourth 
column  are  the  start  and  end  positions,  respectively,  on  the 
corresponding  target  sequences  for  which  probes  were  de¬ 
signed. 


Sequence  start 

Sequence  end 

Probe  start 

Probe  end 

175,231 

175,455 

6 

44 

175,567 

175,677 

36 

71 

488,976 

489,620 

130 

166 

945,569 

946,596 

151 

190 

1,629,522 

1,630,538 

489 

523 

1,629,522 

1,630,538 

529 

568 

1,845,001 

1,845,363 

111 

145 

2,021,535 

2,022,919 

491 

529 

2,098,619 

2,099,274 

591 

625 

2,783,190 

2,783,405 

17 

54 

2,918,788 

2,920,251 

977 

1013 

3,037,856 

3,038,113 

115 

152 

3,524,649 

3,524,731 

17 

55 

3,808,069 

3,809,046 

797 

834 

3,821,617 

3,822,163 

449 

483 

4,374,364 

4,375,478 

227 

311 

4,375,581 

4,376,123 

149 

186 

4,933,405 

4,933,482 

9 

43 

5.  Conclusion 

DNA  sequences  that  are  present  in  a  given  species  or  strain  while  absent 
from  any  other  organism  can  be  used  to  distinguish  the  target  organism 
from  other  related  or  un-related  species.  The  identification  of  such  DNA 
signatures  is  particularly  important  for  organisms  that  may  be  potentially 
used  as  biological  warfare  agents  or  terrorism  threats. 

Most  approaches  used  to  identify  DNA  signatures  are  laboratory  based 
and  require  a  significant  effort  and  time.  A  bioinformatics  approach  can 
provide  results  faster  and  more  efficiently.  However,  most  tools  built  for 
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genome  comparisons  only  allow  alignment  of  two  genomes  at  a  time.  Using 
this  approach  to  find  unique  DNA  signatures  across  all  known  organisms  is 
unfeasible.  In  addition,  all  existing  tools  are  limited  to  finding  the  similarity 
between  two  genomes.  In  contrast,  looking  for  DNA  signatures  requires  the 
development  of  tools  that  identify  sequence  dissimilarities.  In  this  paper,  we 
describe  an  approach  to  find  the  DNA  fingerprints  of  an  organism.  We  used 
this  approach  to  find  a  set  of  unique  sequences  for  B.  anthracis  which  were 
then  used  to  design  probes  for  a  DNA  microarray.  The  hybridization  results 
revealed  that  a  subset  of  these  probes  were  truly  unique  to  B.  anthracis  and 
were  able  to  distinguish  between  B.  anthracis  and  B.  cereus ,  which  is  a  close 
genetic  relative. 
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